In [1]:
from IPython.display import HTML
HTML('''
<script
    src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js ">
</script>
<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
 } else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
    value="Click here to toggle on/off the raw code."></form>
''')
Out[1]:

No description has been provided for this image

Table of Contents:¶

  • Abstract
  • Problem Statement
  • Motivation
  • Scope and Limitations
  • Data Source
  • Methodology
  • Initial Data Preprocessing
  • Data Exploration
  • Focused Data Preprocessing
  • Data Analysis
  • Results and Discussion
  • Conclusion and Insights
  • Recommendations
  • References
  • Acknowledgements

Abstract


The research investigates the impact of the Oscars on the Wikipedia pageviews of related parties, including actors, production staff, and the films themselves. The study employs a methodology involving data preprocessing, exploration, and analysis using Spark to handle large datasets efficiently. Initial preprocessing converted Wikipedia pageview data into a more manageable parquet format. The analysis revealed a significant increase in pageviews for actors and films around the Oscars event, confirming the initial hypothesis of the Oscars' impact. However, the production staff did not experience the same level of recognition. The study highlights disparities in public attention and suggests areas for further research, such as integrating clickstream data to understand the sources of pageviews and employing cluster computing for more efficient data processing. This work underscores the value of distributed computing in handling extensive datasets and provides a foundation for future studies on the visibility and recognition of various contributors in the film industry.

Back to Table of Contents

Problem Statement


The Academy Awards, a prestigious celebration of cinematic excellence, is often perceived as a catalyst for heightened recognition and fame for the films, actors, and directors involved. However, the true extent of its impact remains a subject of inquiry. To delve into this question, we examine the effects of the 95th Academy Awards using Wikipedia page views as a proxy for public interest and attention.

By analyzing the changes in Wikipedia page views before and after the awards ceremony, we can gauge the magnitude of the "Oscar bump" and determine which categories and nominees experienced the most significant surges in public engagement. This analysis will provide valuable insights into the role of the Academy Awards in shaping public perception and generating interest in the film industry.

Back to Table of Contents

Motivation


The Academy Awards hold a prominent position in popular culture, shaping public discourse and influencing trends within the entertainment industry. A nomination or win can transform an actor, director, studio, and/or film from relative obscurity into a household name.

An Academy win often coincides with a rise to stardom, increased marketing traction, and heightened cultural influence. Understanding these concepts is crucial for both industry insiders and fans alike. However, there is currently no comprehensive method to understand the trajectory of an artist or film post-Academy Awards win.

In today’s digital era, these behaviors are often reflected online. Avid fans or curious spectators frequently search for details about winning films. Most, if not all, land on one well-known page – Wikipedia. The website holds a wealth of information on a wide range of topics, including the intricate histories and achievements of films and their creators.

Analyzing Wikipedia page views could provide valuable insights. As a significant source of big data, Wikipedia offers detailed statistics on page visits that reflect public interest and engagement. By examining patterns in page views before and after the Academy Awards, researchers can identify how winning impacts public awareness and interest in a film or artist.

Democratizing this information would allow future analysts, critics, and fans to have a clearer understanding of how the Academy Awards influence the entertainment industry. Making data readily accessible can foster a more inclusive and well-informed public discourse around the significance of these awards. This democratization also underscores how the Academy Awards is not just about Hollywood’s glamour but also about recognizing and celebrating the hard work, creativity, and talent of artists.

Back to Table of Contents

Scope and Limitations


The study's analysis will be focused on The 95th Academy Awards (Oscars 2023) winners hence, the pageview data that will be used will be from 2022 to 2023. This span is considered the relevant for the given event since most of the films included were released in 2022 and the awards concluded in the 1st quarter of 2023.

Pageview data will be analyzed for a period before, during, and after the announcement of nominations and winners to identify trends and patterns.

The study will be limited based on the following:

  1. Due to the differences in titles for the available languages, only the English Wikipedia will be considered in the Analysis.
  2. Titles of Oscars 2023-specific pages will be manually retrieved from the Oscars 2023 Wikipedia Page. As page titles change over time, the filtered pages will be limited based on the date of retrieval.
  3. The study may be limited based on the successfully read files from the Jojie Public Dataset.
  4. The study will be limited only to Oscars 2023. Findings and insights generated from this study may not be applicable directly for the other years.
  5. Other external factors such as concurrent awards, news coverage, and marketing campaigns will not be considered in this study but may have an effect on the result.

Back to Table of Contents

Data Source


The pageview complete dumps can be found on Wikimedia’s dumps page maintained by its analytics team. It’s a comprehensive timeseries of pageview data on a per-article basis of Wikimedia projects such as the English Wikipedia, Wikibooks, and many others. Dumps from December 2007 up to the present are formatted similarly and compressed into a bzip (.bz2) file.

Each dump file contains multiple lines of string which contains the dataset’s different features:

  • Wiki code: The code identifying the specific wiki project.
  • Article title: The name of the article viewed.
  • Page id: A unique identifier for each article.
  • Mode: This denotes the platform through which the page was viewed, such as desktop or mobile.
  • Daily total: The total number of views the article received in one day.
  • Hourly counts: The number of views per hour within a day.

Furthermore, hourly counts is formatted into a string of letter-number pairs which can be deciphered as follows:

Letter A B C D E F G H I J K L M N O P Q R S T U V W X
Equivalent Hour 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

while the number following each letter corresponds to the pageviews for that hour.

For this project, pageviews dumps from 2022 to 2023 will be used considering the timeline of The 95th Academy Awards. The total size of the files to be loaded is shown below.

In [2]:
!du -sch /mnt/data/public/wikipedia/pageviews/pageview_complete/202[23]
368G	/mnt/data/public/wikipedia/pageviews/pageview_complete/2022
311G	/mnt/data/public/wikipedia/pageviews/pageview_complete/2023
678G	total

Back to Table of Contents

Methodology


Figure 1 shows the steps done in this study:

Figure 1. Methodology Overview

No description has been provided for this image

Table 1 describes each step shown in the figure above.
Table 1. Methodology Steps and Description
Step
Process
Description
1 Initial Data Preprocessing Convert Wikipedia pageview data to parquet
2 Data Exploration on the Wikipedia Pageviews Dataset Explore the Wikipedia Pageviews dataset
3 Focused Data Preprocessing Filter and Convert Oscars-specific pageview data to parquet
4 Data Analysis Analyze the Pageviews data for Oscars-specific pages

Back to Table of Contents

Initial Data Preprocessing


Steps done within this section were executed using the notebook Process Wikipedia Pageviews.ipynb. This notebook should be run from top to bottom before proceeding.

Wikipedia Pageview data are stored in text files compressed in bz2 format. When stored in this format, Spark will access each row of data in row format.

For easier usage, each of the pageview dump files were loaded in a Spark Session and were converted to parquet. To avoid additional overhead, only the concatenation of filename was done so that the date of each row can be accessed later. Parquet files are then stored in their specific folder.

In [3]:
import os
os.environ['PYARROW_IGNORE_TIMEZONE'] = '1'
import warnings
warnings.filterwarnings('ignore')
import pyspark.pandas as ps
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from project_functions import *

spark = (SparkSession
         .builder
         .master('local[*]')
         .config('spark.sql.execution.arrow.pyspark.enabled', 'true')
         .getOrCreate())
spark.sparkContext.setLogLevel('OFF')
In [4]:
sdf1 = spark.read.parquet("pageviews_parquet/*")
sdf2 = spark.read.parquet('pageviews_parquet_repeat/*')
sdf = sdf1.union(sdf2)
In [5]:
wiki = (sdf
 .select(
     F.col('_c0').alias('project'),
     F.col('_c1').alias('title'),
     F.col('_c2').alias('pageid'),
     F.col('_c3').alias('mode'),
     F.col('_c4').cast('int').alias('daily_count'),
     F.col('_c5').alias('hourly_counts'),
     F.to_date(F.regexp_substr(F.col('filename'),
                               F.lit(r"\b\d{8}\b")
                              ), 'yyyyMMdd').alias('date')
))

Once the parquet files have been written, it is then loaded into a Spark Session. A sample of the loaded parquet files for Wikipedia Pageviews from years 2022 to 2023 is shown below in Table 2.

Table 2. Sample Wikipedia Pageviews Data
In [6]:
pd.DataFrame(
    wiki.limit(5).collect(),
    columns=[
        "project",
        "title",
        "pageid",
        "mode",
        "daily_count",
        "hourly_counts",
        "date"
    ]
)
Out[6]:
project title pageid mode daily_count hourly_counts date
0 aa.wikibooks - null mobile-web 1 N1 2023-09-23
1 aa.wikibooks Main_Page null desktop 61 B3C5D3E6G2H3I3J2K1L5N4P2Q4R1S4T4U3V2W4 2023-09-23
2 aa.wikibooks Special:Log/Alkab null desktop 1 F1 2023-09-23
3 aa.wikibooks Template:Wikitext_talk_page_converted_to_Flow null desktop 1 A1 2023-09-23
4 aa.wikibooks User:CptViraj null desktop 1 A1 2023-09-23

Back to Table of Contents

Data Exploration on the Wikipedia Pageviews Dataset


The data types for each column of the Wikipedia Pageviews Dataset is shown in Table 3.

Table 3. Data Types of the Wikipedia Pageviews Dataset
In [7]:
pd.DataFrame(wiki.dtypes,
            columns=['Column', 'Data Type'])
Out[7]:
Column Data Type
0 project string
1 title string
2 pageid string
3 mode string
4 daily_count int
5 hourly_counts string
6 date date

An examination of the total number of rows in the entire dataset prior to filtering.

In [8]:
print("The total number of rows in the dataset is ", wiki.count(), ".")
The total number of rows in the dataset is  27226845264 .

Examining the total daily view count across the entire dataset.

In [9]:
print("The total views in the 2022 to 2023 Wikipedia dataset is:")
wiki.agg(F.sum('daily_count')).collect()
The total views in the 2022 to 2023 Wikipedia dataset is:
Out[9]:
[Row(sum(daily_count)=328436025442)]

Back to Table of Contents

Focused Data Preprocessing


Steps done within this section was executed using the notebook Process Oscars Parquet.ipynb. This notebook should be run from top to bottom before proceeding.

Based on the data exploration done, we then focus our study to Oscars 2023-specific pages for analysis. To do so, we limit the dataset to the English wikipedia by filtering the project column to 'en.wikipedia' for easier analysis. We also filter the articles to be analyzed by retrieving the page title of the winning entities of The 95th Academy Awards as shown in their wikipedia article.

The filtered wikipedia articles were then written again to parquet as a checkpoint for easier access. A sample of the filtered Oscars dataset can be found below in Table 4.

Table 4. Sample Oscars Pageview Data
In [10]:
oscars_daily = spark.read.parquet('oscars').toPandas()
oscars_daily
Out[10]:
title pageid mode daily_count hourly_counts date
0 Daniel_Barrett_(visual_effects_supervisor) 34496961 desktop 1.0 I1 2023-09-23
1 A24 38837739 mobile-web 2406.0 A97B105C121D120E86F84G74H67I64J60K54L75M73N83O... 2023-06-27
2 Black_Panther:_Wakanda_Forever null mobile-app 418.0 A19B26C31D20E16F16G20H15I26J16K11L16M16N13O15P... 2023-06-13
3 Avatar:_The_Way_of_Water 25813358 desktop 4160.0 A157B136C155D164E149F146G153H156I123J137K141L1... 2023-05-21
4 A24 38837739 desktop 1900.0 A64B66C62D77E65F85G76H68I53J72K52L42M58N81O130... 2023-08-22
... ... ... ... ... ... ...
67650 A24 38837739 mobile-web 3199.0 A130B157C164D162E124F142G85H104I100J67K80L93M1... 2022-06-04
67651 Brendan_Fraser 386491 desktop 1426.0 A57B63C38D49E46F37G48H49I52J68K72L77M61N58O67P... 2022-06-04
67652 Chandrabose_(lyricist) 8390040 mobile-web 60.0 C3D2E4F4G4H3I2J1L3M3O1P4Q9R4S7T3U1V2 2022-07-09
67653 A24 472347 mobile-web NaN None 2022-02-02
67654 Guillermo_del_Toro's_Pinocchio 62106165 desktop NaN None 2022-02-02

67655 rows × 6 columns

As a sanity check, duplicates were inspected and none were to be found. This means that each row is distinct from the others in the dataset. Table 5. shows the remaining rows after using drop duplicates on the dataset which is equal to the original number of rows.

Table 5. Length of dataset after dropping duplicates
In [11]:
pd.DataFrame([['oscars_daily.drop_duplicates()',
               len(oscars_daily.drop_duplicates())]],
             columns=['Action', 'Result'])
Out[11]:
Action Result
0 oscars_daily.drop_duplicates() 67655

Table 6 shows that null data is present in the daily_count and hourly_counts columns. This may be due to errors in the reading process.

Table 6. Oscars Data Information
In [12]:
df_info_to_dataframe(oscars_daily)
Out[12]:
Column Non-Null Count Dtype
0 title 67655 non-null object
1 pageid 67655 non-null object
2 mode 67655 non-null object
3 daily_count 67510 non-null float64
4 hourly_counts 67434 non-null object
5 date 67655 non-null object

To understand the severity of the null data points, we check which dates are affected.

In [13]:
oscars_daily.iloc[
np.where(np.isnan(oscars_daily['daily_count']))[0]
].date.unique()
Out[13]:
array([datetime.date(2022, 2, 2), datetime.date(2022, 1, 26)],
      dtype=object)
In [14]:
oscars_daily[oscars_daily.hourly_counts.isna()].date.unique()
Out[14]:
array([datetime.date(2022, 1, 30), datetime.date(2022, 2, 2),
       datetime.date(2022, 1, 26)], dtype=object)

Three dates were affected namely, Jan 26 and 30, 2022, and Feb 2, 2022. All of these dates are at the beginning of the dataset which are fairly remote from the dates significant to the Oscars 2023 event.

To fix this, null data for the daily_count columns will be filled with zeroes as the dates do not affect the analysis that will be done. A separate DataFrame for daily data will be made that excludes the hourly_counts column. For the hourly_counts, another DataFrame will be made which focuses on this column. However, null values will be dropped and will not be derived from the daily_count due to the lack of hourly distribution. In doing so, it will not affect the distribution of the counts per hour when analyzed.

In [15]:
oscars_daily_clean = (oscars_daily
                      .drop(columns='hourly_counts')
                      .fillna(0))

Additional columns will also be added to the datasets. The date column will be converted to datetime, from which will derive the month, year, and day_of_week columns.

Entities denoted by each page title will also be categorized into Movies, Actors, Production, and Others, which will be included in the analysis. Each of these columns have a boolean data type and a True value will entail that the entities belongs to that category.

Sample of the dataset that will be used for analysis can be found below in Table X.

Table X. Sample of the Dataset used for Analysis
In [16]:
oscars_daily_clean['date'] = pd.to_datetime(oscars_daily_clean.date)
oscars_daily_clean['year'] = oscars_daily_clean.date.dt.year
oscars_daily_clean['month'] = oscars_daily_clean.date.dt.month
oscars_daily_clean['day_of_week'] = oscars_daily_clean.date.dt.dayofweek

category_mapping = {
    "95th_Academy_Awards": "Others",
    "Everything_Everywhere_All_at_Once": "Movies",
    "Michelle_Yeoh": "Actors",
    "Ke_Huy_Quan": "Actors",
    "Jamie_Lee_Curtis": "Actors",
    "A24": "Production",
    "All_Quiet_on_the_Western_Front_(2022_film)": "Movies",
    "The_Whale_(2022_film)": "Movies",
    "Avatar:_The_Way_of_Water": "Movies",
    "Black_Panther:_Wakanda_Forever": "Movies",
    "The_Boy,_the_Mole,_the_Fox_and_the_Horse_(film)": "Movies",
    "The_Elephant_Whisperers": "Movies",
    "Guillermo_del_Toro's_Pinocchio": "Movies",
    "An_Irish_Goodbye": "Movies",
    "Navalny_(film)": "Movies",
    "RRR": "Movies",
    "Top_Gun:_Maverick": "Movies",
    "Women_Talking_(film)": "Movies",
    "Daniels_(directors)": "Production",
    "Brendan_Fraser": "Actors",
    "Hauschka": "Production",
    "Charlie_Mackesy": "Production",
    "M._M._Keeravani": "Production",
    "Sarah_Polley": "Production",
    "Miriam_Toews": "Production",
    "Guillermo_del_Toro": "Production",
    "Mark_Gustafson": "Production",
    "Alex_Bulkley": "Production",
    "Edward_Berger": "Production",
    "Daniel_Roher": "Production",
    "Odessa_Rae": "Production",
    "Shane_Boris": "Production",
    "Kartiki_Gonsalves": "Production",
    "Guneet_Monga": "Production",
    "Chandrabose_(lyricist)": "Production",
    "James_Mather_(sound_editor)": "Production",
    "Al_Nelson_(sound_engineer)": "Production",
    "Chris_Burdon": "Production",
    "Christian_M._Goldbeck": "Production",
    "Ernestine_Hipper": "Production",
    "James_Friend": "Production",
    "Adrien_Morot": "Production",
    "Judy_Chin": "Production",
    "Annemarie_Bradley": "Production",
    "Ruth_E._Carter": "Production",
    "Paul_Rogers_(film_editor)": "Production",
    "Joe_Letteri": "Production",
    "Richard_Baneham": "Production",
    "Eric_Saindon": "Production",
    "Daniel_Barrett_(visual_effects_supervisor)": "Production"
}

# Add a new "category" column to the DataFrame based on the mapping
# Make sure the item column in df_eda matches the keys in category_mapping
oscars_daily_clean["category"] = (oscars_daily_clean["title"]
                                  .map(category_mapping))
oscars_daily_clean = pd.merge(oscars_daily_clean,
                              pd.get_dummies(oscars_daily_clean.category),
                              left_index=True, right_index=True)

oscars_daily_clean
Out[16]:
title pageid mode daily_count date year month day_of_week category Actors Movies Others Production
0 Daniel_Barrett_(visual_effects_supervisor) 34496961 desktop 1.0 2023-09-23 2023 9 5 Production False False False True
1 A24 38837739 mobile-web 2406.0 2023-06-27 2023 6 1 Production False False False True
2 Black_Panther:_Wakanda_Forever null mobile-app 418.0 2023-06-13 2023 6 1 Movies False True False False
3 Avatar:_The_Way_of_Water 25813358 desktop 4160.0 2023-05-21 2023 5 6 Movies False True False False
4 A24 38837739 desktop 1900.0 2023-08-22 2023 8 1 Production False False False True
... ... ... ... ... ... ... ... ... ... ... ... ... ...
67650 A24 38837739 mobile-web 3199.0 2022-06-04 2022 6 5 Production False False False True
67651 Brendan_Fraser 386491 desktop 1426.0 2022-06-04 2022 6 5 Actors True False False False
67652 Chandrabose_(lyricist) 8390040 mobile-web 60.0 2022-07-09 2022 7 5 Production False False False True
67653 A24 472347 mobile-web 0.0 2022-02-02 2022 2 2 Production False False False True
67654 Guillermo_del_Toro's_Pinocchio 62106165 desktop 0.0 2022-02-02 2022 2 2 Movies False True False False

67655 rows × 13 columns

With the preprocessing done, we get the total daily count and count of unique titles in the dataset per year-month to have a view on the the months that lack certain data (see Table 7).

Table 7. Total Daily Count and Count of Unique Titles per Year-Month
In [17]:
oscars_daily_clean.groupby(['year', 'month']).agg({'title':'nunique',
                                                   'daily_count': 'sum'})
Out[17]:
title daily_count
year month
2022 1 33 1458271.0
2 33 876168.0
3 34 1996419.0
4 33 2087088.0
5 35 4129596.0
6 34 3621363.0
7 36 3131939.0
8 36 3505628.0
9 37 7708977.0
10 36 4814380.0
11 40 11675821.0
12 39 17995389.0
2023 1 40 19711419.0
2 40 11565567.0
3 50 31794293.0
4 50 5805783.0
5 49 4024597.0
6 49 4554705.0
7 49 4157075.0
8 48 3444620.0
9 49 2905820.0
10 49 3001665.0
11 49 2861168.0
12 49 3475458.0

Aside from reading errors of the data, some pages may have been created within the span of dates selected or may have changes in the title which we used to filter the dataset. These are some reasons why there are less unique page titles in 2022 vs in 2023. However, for the awarding of The 95th Academy Awards which was held on March 2023, there more data points that were considered so the analysis will still proceed.

Back to Table of Contents

Data Analysis


To ensure that no changes would affect the original dataframe a copy of oscars_daily_clean will be stored in df_daily_eda. This served as the dataframe to be used in the analysis section.

In [18]:
df_daily_eda = oscars_daily_clean.copy()

The new dataframe is then adjusted to be grouped by its titles and aggregated across the view counts. The purpose of this was to remove the repetition of titles across view modes, and pageid, and instead summed up the total view count for the title.

In [19]:
# List of columns to exclude
exclude_columns = ['pageid', 'mode', 'daily_count']

# Create a list of column titles, excluding the specified columns
titles = [x for x in df_daily_eda.columns if x not in exclude_columns]
df_daily_eda = (
    df_daily_eda
    .groupby(titles, as_index=False)
    .agg({"daily_count": "sum"})
)

With the data more focused, the data was examined through plots. Plotly was the primary method used given its interactive nature. To start the data distribution was examined via a bar plot as seen below in Fig. 1.

Fig. 1. Total Dates Covered per Title
In [20]:
#Aggregating data for the "title" column
title_counts = df_daily_eda['title'].value_counts().reset_index()
title_counts.columns = ['title', 'count']

# Aggregating data for the "category" column
category_counts = df_daily_eda['category'].value_counts().reset_index()
category_counts.columns = ['category', 'count']

# Plot the Title distribution
title_fig = px.bar(title_counts,
                   x='title',
                   y='count', 
                   title='Title Distribution', 
                   labels={'count': 'Frequency'})

title_fig.update_layout(xaxis={'categoryorder': 'total descending'},
                       yaxis={'automargin': True,
                              'range': [0, title_counts['count'].max() + 200]})

# Display plots
title_fig.show()

The above plot, shows the distribution of the the titles. From here it was seen that ideally there should be around 638 instances of a title, however th shape and actual counts shows that, that is not the case for all. The floor is Annemarie_Bradley at around 8 instances only.

Next, a box plot was made to check the distribution of views across the titles as seen in Fig. 2 below.

Fig. 2. Distribution of Pageviews per Title
In [21]:
import pandas as pd
import plotly.express as px

# Create the violin plot
fig_box = px.box(df_daily_eda, x="title", y="daily_count", notched=False)

fig_box.update_layout(
    title='Distribution of Views per Title',
    xaxis_title='Title',
    yaxis_title='Views',
    xaxis={
        'categoryorder': 'total descending',
        'tickangle': 45,  # Rotate labels by -45 degrees
        'tickfont': {'size': 10}  # Set font size to 10
    },
    yaxis_type="log",
    showlegend=False,
    height=1200,
    width=800,
    margin=dict(b=200), 
)

# Display the updated figure
fig_box.show()

Due to the volume of the actounts counts of views, the scale plotted on a logarithmic scale so that it becomes more readable. The functionality of plotly allowed an examination of the percentiles and mean median mode of each title with respect to their views.

For example, Avatar_The_Way_of_water which has overall the highest number of views daily has a max number of daily views of around 792.89k and a median of 16.502k. This gap suggests that there were periods in time where the views had an exponential spike in count. This was also suggested by the other plots especially those on the leftside, which tended be those of the Actors or Movies categories.

Next we check for any correlations between categories, as seen in Fig. 3 below.

Fig. 3. Correlation Matrix among Features
In [22]:
# Assuming df_hourly_eda2 is your DataFrame
# Select only numeric columns for correlation calculation
numeric_df = df_daily_eda[["Actors", "Movies", "Production", "Others",
                           "month", "day_of_week", "daily_count", "year"]]

# Now calculate correlation on numeric DataFrame
correlation_matrix = numeric_df.corr()

plt.figure(figsize=(12, 8))

# Use seaborn to create a heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='viridis', fmt=".2f",
           linewidths=.5, cbar_kws={"shrink": .8})

plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)

plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image

From the outset it was seen that the category of Production was strongly negatively correlated(-0.75) to the Movies category. This suggested that the pageview data of one tends to exclude the other.

There are some notable relationships however. First is between Movies and Actors, with a weak negative correlation (-.20), suggesting that higher values of one results in lower for the other.

In terms of positive relationships, it appeared the both Movies and Actors had slightly positive relationships with pageviews suggesting that there would be an increase in the value for these categories in overall views or instances.

Given the volume of data and its spatial nature, the next step was to examine the trend of the daily_count or the number of views per page. To this end a timeseries graph (Fig. 4) was created using the line plot in plotly.

Fig. 4. Timeseries Plot for Daily Views per Page
In [23]:
actors = np.ravel(
    [x for x in df_daily_eda.groupby('title')['Actors'].unique()]
)
movies = np.ravel(
    [x for x in df_daily_eda.groupby('title')['Movies'].unique()]
)
production = np.ravel(
    [x for x in df_daily_eda.groupby('title')['Production'].unique()]
)
others = np.ravel(
    [x for x in df_daily_eda.groupby('title')['Others'].unique()]
)
In [24]:
df_daily_eda = df_daily_eda.sort_values(by='date') 
In [25]:
# Assuming df_daily_eda is your DataFrame
df_daily_eda['date'] = pd.to_datetime(df_daily_eda['date'])

# Group by day
df_daily = df_daily_eda.groupby(
    ['date', 'title', 'category']
)['daily_count'].sum().reset_index()

# Group by month
df_daily_eda['month'] = df_daily_eda['date'].dt.to_period('M')
df_monthly = df_daily_eda.groupby(
    ['month', 'title', 'category']
)['daily_count'].sum().reset_index()
df_monthly['month'] = df_monthly['month'].dt.to_timestamp()

# Create a figure
fig_line = go.Figure()

# Get the unique categories
categories = df_daily['category'].unique()

# Sort titles alphabetically
titles = sorted(df_daily['title'].unique())

# Add daily traces for each unique title
for title in titles:
    title_data = df_daily[df_daily['title'] == title]
    fig_line.add_trace(go.Scatter(
        x=title_data['date'],
        y=title_data['daily_count'],
        mode='lines+markers',
        name=title,
        visible=True,  # Initially, make daily traces visible
        # legendgroup='daily',
        hovertemplate='Date: %{x}<br>Views: %{y}<br>Title: %{text}',
        text=[title] * len(title_data)
    ))

# Add monthly traces for each unique title (initially hidden)
for title in titles:
    title_data = df_monthly[df_monthly['title'] == title]
    fig_line.add_trace(go.Scatter(
        x=title_data['month'],
        y=title_data['daily_count'],
        mode='lines+markers',
        name=title,
        visible=False,  # Initially, make monthly traces hidden
        # legendgroup='monthly',
        hovertemplate='Month: %{x}<br>Views: %{y}<br>Title: %{text}',
        text=[title] * len(title_data)
    ))

# Create buttons for dropdown menu to filter by category
buttons = []
for category in categories:
    category_titles = sorted(
        df_daily[df_daily['category'] == category]['title'].unique()
    )
    visible_daily = ([title in category_titles for title in titles] +
                     [False] * len(titles))
    visible_monthly = ([False] * len(titles) +
                       [title in category_titles for title in titles])
    buttons.append(dict(
        label=f'{category} (Daily)',
        method='update',
        args=[{'visible': visible_daily},
              {'title': (f'Views of Titles Over Time - Category: {category}'
                         ' (Daily)')}]
    ))
    buttons.append(dict(
        label=f'{category} (Monthly)',
        method='update',
        args=[{'visible': visible_monthly},
              {'title': (f'Views of Titles Over Time - Category: {category}'
                         ' (Monthly)')}]
    ))

# Add a button to show all titles (daily)
buttons.append(dict(
    label='All (Daily)',
    method='update',
    args=[{'visible': [True] * len(titles) + [False] * len(titles)},
          {'title': 'Views of Titles Over Time - All Categories (Daily)'}]
))

# Add a button to show all titles (monthly)
buttons.append(dict(
    label='All (Monthly)',
    method='update',
    args=[{'visible': [False] * len(titles) + [True] * len(titles)},
          {'title': 'Views of Titles Over Time - All Categories (Monthly)'}]
))

# Update layout with dropdown menu
fig_line.update_layout(
    title='Views of Titles Over Time - All Categories (Daily)',
    xaxis_title='Date',
    yaxis_title='Views',
    # yaxis=dict(type='log'),  #comment out for viewing in normal scale
    updatemenus=[dict(
        active=0,
        buttons=buttons,
        x=1.15,
        y=1.15
    )],
    template='plotly_white',
    legend=dict(traceorder='normal')  # Ensure the legend is sorted
)

# Show the figure
fig_line.show()

Exploration of the above graph yielded several observations.

  • Movies and Actors were the most significantly affected by the spike of pageviews on the day of the Oscars in March but there was a spike across all titles
  • Some titles in production only made their appearance on March of 2023, during the Oscars period.
  • Avatar: The Way of Water had the highest views in a single instance on Dec 1, 2022
  • Everything Everywhere All at Once, had the highest searches during the Oscars month of march 2023.
  • The Whale had slightly lower but similar levels of engagement during the Oscars, and its premiere the previous year on the fifth of September 2022

A log-scaled version of Fig. 4 is shown below (Fig. 5).

Fig. 5. Timeseries Plot of Daily Views per Page (Log-scale)
In [26]:
# Update layout with dropdown menu
fig_line.update_layout(
    title='Views of Titles Over Time - All Categories (Daily)',
    xaxis_title='Date',
    yaxis_title='Views',
    yaxis=dict(type='log'),  #comment out for viewing in normal scale
    updatemenus=[dict(
        active=0,
        buttons=buttons,
        x=1.15,
        y=1.15
    )],
    template='plotly_white',
    legend=dict(traceorder='normal')  # Ensure the legend is sorted
)

# Show the figure
fig_line.show()

The modular graph above allowed a thorough examination and comparison of the trends of pageviews between and within categories.

Fig. 6. Top Titles for 2022 and 2023
In [27]:
# Find the top titles by daily_count for each year
top_titles_hist = df_daily_eda.sort_values(
    by="daily_count", ascending=False
).groupby("year").apply(lambda x: x.nlargest(10, 'daily_count'))

# Create a Plotly histogram
fig_hist = px.histogram(
    top_titles_hist, x="year", y="daily_count", color="title",
    title="Top Titles by Daily Count Each Year",
    labels={"daily_count": "Daily Count", "title": "Title"}, barmode="group")

# Show the plot
fig_hist.show()

The bar graphs in Fig. 6 support the observation on spikes concerning the movies and actors much more than the production staff.

Fig. 7. Timeseries Graph of The Whale-related pages
In [28]:
# Assuming df_daily_eda is your DataFrame
df_daily_eda['date'] = pd.to_datetime(df_daily_eda['date'])

# Filter the data for "The Whale" group
whale_titles = ['Adrien_Morot', 'Judy_Chin', 'Annemarie_Bradley',
                'Brendan_Fraser', 'The_Whale_(2022_film)']
df_whale = df_daily_eda[df_daily_eda['title'].isin(whale_titles)]

# Group by day
df_daily_whale = df_whale.groupby(['date', 'title', 'category']
                                 )['daily_count'].sum().reset_index()

# Group by month
df_whale['month'] = df_whale['date'].dt.to_period('M')
df_monthly_whale = df_whale.groupby(['month', 'title', 'category']
                                   )['daily_count'].sum().reset_index()
df_monthly_whale['month'] = df_monthly_whale['month'].dt.to_timestamp()

# Create a figure
fig_whale = go.Figure()

# Sort titles alphabetically
titles_whale = sorted(df_daily_whale['title'].unique())

# Add daily traces for each unique title in "The Whale"
for title in titles_whale:
    title_data = df_daily_whale[df_daily_whale['title'] == title]
    fig_whale.add_trace(go.Scatter(
        x=title_data['date'],
        y=title_data['daily_count'],
        mode='lines+markers',
        name=title,
        visible=True,  # Initially, make daily traces visible
        hovertemplate='Date: %{x}<br>Views: %{y}<br>Title: %{text}',
        text=[title] * len(title_data)
    ))

# Add monthly traces for each unique title in "The Whale" (initially hidden)
for title in titles_whale:
    title_data = df_monthly_whale[df_monthly_whale['title'] == title]
    fig_whale.add_trace(go.Scatter(
        x=title_data['month'],
        y=title_data['daily_count'],
        mode='lines+markers',
        name=title,
        visible=False,  # Initially, make monthly traces hidden
        hovertemplate='Month: %{x}<br>Views: %{y}<br>Title: %{text}',
        text=[title] * len(title_data)
    ))

# Create buttons for dropdown menu to filter by daily and monthly data
buttons_whale = [
    dict(
        label='Daily',
        method='update',
        args=[{'visible': ([True] * len(titles_whale) + [False] *
                           len(titles_whale))},
              {'title': 'Views of Titles Over Time - The Whale (Daily)',
               'yaxis': {'type': 'log'}}]
    ),
    dict(
        label='Monthly',
        method='update',
        args=[{'visible': ([False] * len(titles_whale) + [True] *
                           len(titles_whale))},
              {'title': 'Views of Titles Over Time - The Whale (Monthly)',
               'yaxis': {'type': 'linear'}}]
    )
]

# Update layout with dropdown menu
fig_whale.update_layout(
    title='Views of Titles Over Time - The Whale (Daily)',
    xaxis_title='Date',
    yaxis_title='Views',
    yaxis=dict(type='log'),  # Set initial y-axis type to log for daily data
    updatemenus=[dict(
        active=0,
        buttons=buttons_whale,
        x=1.15,
        y=1.15
    )],
    template='plotly_white',
    legend=dict(traceorder='normal')  # Ensure the legend is sorted
)

# Show the figure
fig_whale.show()
Fig. 8. Timeseries Graph of Everything, Everywhere, All at Once-related pages
In [29]:
import pandas as pd
import plotly.graph_objects as go

# Assuming df_daily_eda is your DataFrame
df_daily_eda['date'] = pd.to_datetime(df_daily_eda['date'])

# Filter the data for "Everything Everywhere All at Once" group
eeaao_titles = ["Ke_Huy_Quan", "Michelle_Yeoh", "Jamie_Lee_Curtis",
                "Everything_Everywhere_All_at_Once", "Paul_Rogers"]
df_eeaao = df_daily_eda[df_daily_eda['title'].isin(eeaao_titles)]

# Group by day
df_daily_eeaao = df_eeaao.groupby(['date', 'title', 'category']
                                 )['daily_count'].sum().reset_index()

# Group by month
df_eeaao['month'] = df_eeaao['date'].dt.to_period('M')
df_monthly_eeaao = df_eeaao.groupby(['month', 'title', 'category']
                                   )['daily_count'].sum().reset_index()
df_monthly_eeaao['month'] = df_monthly_eeaao['month'].dt.to_timestamp()

# Create a figure
fig_eeaao = go.Figure()

# Sort titles alphabetically
titles_eeaao = sorted(df_daily_eeaao['title'].unique())

# Add daily traces for each unique title in "Everything Everywhere All at Once"
for title in titles_eeaao:
    title_data = df_daily_eeaao[df_daily_eeaao['title'] == title]
    fig_eeaao.add_trace(go.Scatter(
        x=title_data['date'],
        y=title_data['daily_count'],
        mode='lines+markers',
        name=title,
        visible=True,  # Initially, make daily traces visible
        hovertemplate='Date: %{x}<br>Views: %{y}<br>Title: %{text}',
        text=[title] * len(title_data)
    ))


for title in titles_eeaao:
    title_data = df_monthly_eeaao[df_monthly_eeaao['title'] == title]
    fig_eeaao.add_trace(go.Scatter(
        x=title_data['month'],
        y=title_data['daily_count'],
        mode='lines+markers',
        name=title,
        visible=False,  # Initially, make monthly traces hidden
        hovertemplate='Month: %{x}<br>Views: %{y}<br>Title: %{text}',
        text=[title] * len(title_data)
    ))

# Create buttons for dropdown menu to filter by daily and monthly data
buttons_eeaao = [
    dict(
        label='Daily',
        method='update',
        args=[{'visible': ([True] * len(titles_eeaao) + [False] *
                           len(titles_eeaao))},
              {'title': ('Views of Titles Over Time - Everything Everywhere '
                         'All at Once (Daily)'), 'yaxis': {'type': 'log'}}]
    ),
    dict(
        label='Monthly',
        method='update',
        args=[{'visible': ([False] * len(titles_eeaao) + [True] *
                           len(titles_eeaao))},
              {'title': ('Views of Titles Over Time - Everything Everywhere '
                         'All at Once (Monthly)'), 'yaxis': {'type': 'linear'}}
             ]
    )
]

# Update layout with dropdown menu
fig_eeaao.update_layout(
    title=('Views of Titles Over Time - Everything Everywhere All '
           'at Once (Daily)'),
    xaxis_title='Date',
    yaxis_title='Views',
    yaxis=dict(type='log'),  # Set initial y-axis type to log for daily data
    updatemenus=[dict(
        active=0,
        buttons=buttons_eeaao,
        x=1.15,
        y=1.15
    )],
    template='plotly_white',
    legend=dict(traceorder='normal')  # Ensure the legend is sorted
)

# Show the figure
fig_eeaao.show()
Table 8. Difference between Average Daily Views per page Before and After the Oscars
In [30]:
# Assuming df_daily_eda is your DataFrame
df_daily_eda['date'] = pd.to_datetime(df_daily_eda['date'])

# Define the cutoff date
date_cutoff = pd.Timestamp('2023-03-13')

# Define the date range for 7 days before and 7 days after
date_start_before = date_cutoff - pd.Timedelta(days=7)
date_end_after = date_cutoff + pd.Timedelta(days=7)

# Filter data for 7 days before and 7 days after March 13, 2023
df_before = df_daily_eda[(df_daily_eda['date'] >= date_start_before) &
(df_daily_eda['date'] < date_cutoff)]
df_after = df_daily_eda[(df_daily_eda['date'] >= date_cutoff) &
(df_daily_eda['date'] <= date_end_after)]

# Calculate average daily_count for each title 7 days before March 13, 2023
avg_daily_count_before = df_before.groupby('title'
                                          )['daily_count'].mean().reset_index()
avg_daily_count_before.columns = ['title', 'avg_daily_count_before']

# Calculate average daily_count for each title 7 days after March 13, 2023
avg_daily_count_after = df_after.groupby('title'
                                        )['daily_count'].mean().reset_index()
avg_daily_count_after.columns = ['title', 'avg_daily_count_after']

# Merge the results to have a single DataFrame
avg_daily_count = pd.merge(avg_daily_count_before, avg_daily_count_after,
                           on='title', how='outer')

# Add period columns
avg_daily_count_before['period'] = '7 Days Before March 13, 2023'
avg_daily_count_after['period'] = '7 Days After March 13, 2023'

avg_daily_count_before.columns = ['title', 'avg_daily_count', 'period']
avg_daily_count_after.columns = ['title', 'avg_daily_count', 'period']

# Combine into a single DataFrame
avg_daily_count_period = pd.concat([avg_daily_count_before,
                                    avg_daily_count_after
                                   ],
                                   axis=0).reset_index(drop=True)

# Sort by title for better readability
avg_daily_count_period = avg_daily_count_period.sort_values(
    by='title').reset_index(drop=True)

# Display the DataFrame


# Set display option to show all rows
pd.set_option('display.max_rows', None)
avg_diff = avg_daily_count_period.pivot(index="title", columns="period",
                                        values="avg_daily_count"
                                       ).sort_index(axis=1,
                                                    ascending=False).dropna()
avg_diff['Difference'] = (avg_diff["7 Days After March 13, 2023"] -
                          avg_diff["7 Days Before March 13, 2023"])
avg_diff
Out[30]:
period 7 Days Before March 13, 2023 7 Days After March 13, 2023 Difference
title
95th_Academy_Awards 65819.285714 267900.375 202081.089286
A24 5447.857143 31686.375 26238.517857
Adrien_Morot 69.000000 1201.875 1132.875000
Alex_Bulkley 36.285714 212.750 176.464286
All_Quiet_on_the_Western_Front_(2022_film) 29047.000000 70116.500 41069.500000
An_Irish_Goodbye 9.142857 8799.125 8789.982143
Avatar:_The_Way_of_Water 40677.714286 42877.875 2200.160714
Black_Panther:_Wakanda_Forever 19906.142857 22201.375 2295.232143
Brendan_Fraser 44603.142857 324174.625 279571.482143
Chandrabose_(lyricist) 444.285714 9161.375 8717.089286
Charlie_Mackesy 496.142857 2746.250 2250.107143
Chris_Burdon 9.000000 143.000 134.000000
Daniel_Barrett_(visual_effects_supervisor) 15.428571 141.625 126.196429
Daniel_Roher 307.428571 3651.750 3344.321429
Daniels_(directors) 9489.571429 86606.875 77117.303571
Edward_Berger 1421.000000 4570.000 3149.000000
Eric_Saindon 27.714286 389.625 361.910714
Everything_Everywhere_All_at_Once 72574.428571 459437.750 386863.321429
Guillermo_del_Toro 5158.285714 20537.125 15378.839286
Guillermo_del_Toro's_Pinocchio 5644.428571 17002.500 11358.071429
Guneet_Monga 287.285714 21180.750 20893.464286
Hauschka 513.142857 3936.875 3423.732143
James_Mather_(sound_editor) 10.142857 117.500 107.357143
Jamie_Lee_Curtis 32897.000000 238968.125 206071.125000
Joe_Letteri 60.142857 607.875 547.732143
Kartiki_Gonsalves 156.714286 22581.000 22424.285714
Ke_Huy_Quan 22609.857143 319866.875 297257.017857
M._M._Keeravani 2714.714286 35578.250 32863.535714
Mark_Gustafson 3.500000 5.625 2.125000
Michelle_Yeoh 38678.571429 314330.875 275652.303571
Miriam_Toews 2691.428571 3095.000 403.571429
Navalny_(film) 2146.714286 17584.625 15437.910714
Odessa_Rae 77.000000 748.125 671.125000
RRR 122.142857 339.500 217.357143
Richard_Baneham 87.285714 970.125 882.839286
Ruth_E._Carter 383.000000 10038.500 9655.500000
Sarah_Polley 9484.714286 39948.125 30463.410714
Shane_Boris 64.428571 355.125 290.696429
The_Boy,_the_Mole,_the_Fox_and_the_Horse_(film) 2076.428571 7927.375 5850.946429
The_Elephant_Whisperers 1385.428571 62919.875 61534.446429
The_Whale_(2022_film) 43692.857143 200155.125 156462.267857
Top_Gun:_Maverick 20000.714286 26370.000 6369.285714
Women_Talking_(film) 34357.000000 32671.250 -1685.750000
Table 9. Summary Statistics of the Difference of Average Daily Views per page Before and After the Oscars
In [31]:
avg_diff.describe()
Out[31]:
period 7 Days Before March 13, 2023 7 Days After March 13, 2023 Difference
count 43.000000 43.000000 43.000000
mean 11993.104651 63578.029070 51584.924419
std 19007.590382 113297.582174 97780.624236
min 3.500000 5.625000 -1685.750000
25% 82.142857 1086.000000 609.428571
50% 1421.000000 17002.500000 6369.285714
75% 19953.428571 41413.000000 31663.473214
max 72574.428571 459437.750000 386863.321429

Back to Table of Contents

To effectively answer the question, "Does the Oscars have an effect on Wikipedia Pageviews?", we want to find how different the pageviews on the days of the Nominations and Awarding is compared to the rest of their respective months. Calculating the z-score of those days based on the distribution of pageviews of the rest of that month enables us to answer this statistically.

In [32]:
nomination_date = pd.to_datetime('2023-01-24', format='%Y-%m-%d')
awarding_date = pd.to_datetime('2023-03-13', format='%Y-%m-%d')

zscore_dict = {}
for title in category_mapping.keys():
    zscore_dict[title] = {}
    zscore_dict[title]['Nominations'] = calculate_date_zscore(
        df_daily_eda, title, nomination_date)
    zscore_dict[title]['Awarding'] = calculate_date_zscore(
        df_daily_eda, title, awarding_date)
    
z_df = pd.DataFrame(zscore_dict).T.dropna().sort_index()

Results and Discussion


Distribution Histogram

After engaging with data through a thorough exploratory data analysis, several insights and observations were gained. Initially there was a discrepancy found for the distribution of the data as seen in (see Fig. 2). Aside from the discussed limitations, the lack of view histories for the page suggested either a lack of traffic, or a recency in the formation of the page. As of the writing of this paper, one of the titles in the production category, Annemarie Bradley, only begun to garner views on March 29, 2023, thirteen days after the academy awards. This may be the day their wikipedia page was formed given their win in their category, or it was only when they'd gained attention. Apart

Box and Whiskers Plot

Due to the initial findings from the histogram, the next step was to check how the pageviews or daily_countare distributed across the titles. This resulted in the box and whiskers plot (see Fig. 2) Avatar the Way of Water was at the top of the view counts overall, and the rest of the plot showed that across titles the view counts varied, and there were clear spikes and greater outliers in terms of volume for the movies and actors. This heavily suggested that certain time periods and happenings in those periods caused the pageviews of these titles to increase exponentially.

Correlation Matrix

The next instance was to examine the relationships between and among the different features. The titles themselves were not used and instead the relationship of the categories and time was observed(see Fig. 3). As mentioned in the earlier section there was a significantly negative correlation between those in the Movies category and the Production category. This was most likely due to the boolean nature of the those columns, given that there were more instances of datapoints being part of Production given that there were more production based oscars to win, as opposed to movies, it was a given that their negative relationship would be more prominent.

A noteworthy observation from the analysis is the subtle yet significant negative correlation between the total views a production received and the proportion of those views attributed to production staff members. This finding suggests that as a production gains popularity and accumulates more views overall, the share of those views directed towards the individuals involved in the production process – directors, cinematographers, editors, and other behind-the-scenes contributors – tends to decrease.

The phenomenon found in the correlation matrix aligns with the earlier discovery that the majority of views are typically garnered by the movies themselves and the actors featured in them. As audiences engage with a production, their attention naturally gravitates towards the final product and the on-screen talent, potentially overshadowing the crucial contributions of the individuals who bring the creative vision to life.

Timeseries Graphs

One of the primary concerns of the paper was the effect the Academy Awards would have on its winners. A simple timeseries graph plotting the the view count of each title across two years was created as seen above(see Fig. 5), and while it contains a hoard of information, for the purposes of illustration, two films and their respectives actors and production staff, at least those who won in their respective categories will be the focus of the discussion here. First will be Everything Everywhere All at Once, the best film of the 95th Academy Awards. With the exception of Ke Huy Quan, the actors were receivieng monthly views in the hundreds of thousands which was on its own is quite large but small relative to March 13th, 2023(see Fig. 8). This supports the initial assumption of the significant attention the Oscars offers. Ke Huy Quan serves as an excellent example of this effect given that prior to the film he was practically not part of the industry, his co-stars were already well-established figures in Hollywood. He started with views in the single digits in 2022 to hitting the highest number of views among all his co-stars, and in the following days was receiving activity similar to them as well.

A similar trend was found with the Whale with both Brendan Fraser and the film itself following a similar trend in pageviews, with both experiencing a significant spike in views on the day of its release, only beaing beaten by the day of the Academy Awards(see Fig. 7). Both films experienced a slow rise in views following their nomination and peaking during the day of the Oscars, followed by a rapid decline in activity days after. Even when examining the average views of each title the number would increase in the days after the Oscars. A notable exception was the film Women Talking. All names in the titles column showed no negative difference in their average views when comparing their numbers after and before, except for Women Talking. One week prior to the Oscars the film had an average of about thirty-four thousand views and afterwards it received an average of one-thousand six-hundered less views.

It was also in the examination of the The Whale that there was support for some of the observations seen in the correlation matrix with regards to production and view counts. For example, Annemarie Bradley the make-up artist of The Whale, a quick comparison of their views across time showed that inspite of the bump received by the oscar win for them it was significantly smaller compared to the film. Even when comparing The Whale, its production staff members that won, Judy Chin, Annemarie, and Adrien Morot the jump was from a high ten thousand views to The Whale's and Brendan Fraser's one-hundred thousand views, peaking at around three million. This example underscores the trend observed, highlighting the disparity in visibility between on-screen and behind-the-scenes talent.

Z-Score Table

Table 10 below shows the z-score of the pageviews for both nominations and awarding days of each page. Z-scores higher than 3 suggest that the day of the event is an "outlier" or "significantly different" from the other days in the month.

Table 10. Z-Score Table
In [33]:
z_df
Out[33]:
Nominations Awarding
95th_Academy_Awards 12.755634 30.617676
A24 1.718809 16.466546
Adrien_Morot 4.371992 31.806244
All_Quiet_on_the_Western_Front_(2022_film) 5.138306 20.339506
An_Irish_Goodbye 14.171374 11.092046
Avatar:_The_Way_of_Water -0.648066 4.052241
Black_Panther:_Wakanda_Forever 0.348064 5.730825
Brendan_Fraser 1.160336 13.663121
Chandrabose_(lyricist) 0.952919 13.609512
Charlie_Mackesy -0.186678 16.400760
Chris_Burdon 1.134733 33.549936
Daniel_Barrett_(visual_effects_supervisor) 1.083357 26.363566
Daniel_Roher 4.038545 28.532929
Daniels_(directors) 2.784789 20.625195
Edward_Berger 5.997203 31.319557
Eric_Saindon 4.046397 4.684864
Everything_Everywhere_All_at_Once 3.484714 16.170656
Guillermo_del_Toro -0.243263 22.288759
Guillermo_del_Toro's_Pinocchio 0.527514 16.944542
Guneet_Monga 3.841287 9.340992
Hauschka 4.295989 38.465112
James_Mather_(sound_editor) 3.531734 23.772698
Jamie_Lee_Curtis 1.836097 18.955645
Joe_Letteri 4.391260 35.033553
Ke_Huy_Quan 0.159509 14.190561
M._M._Keeravani -0.077411 9.824345
Mark_Gustafson -0.297735 4.667252
Michelle_Yeoh 0.563599 14.309144
Miriam_Toews 3.533487 8.961985
Navalny_(film) 4.863658 30.284319
Paul_Rogers_(film_editor) -1.367073 1.716332
RRR 0.832891 14.948348
Richard_Baneham 12.081255 31.630784
Ruth_E._Carter 3.793557 39.153387
Sarah_Polley 5.697720 26.075422
The_Boy,_the_Mole,_the_Fox_and_the_Horse_(film) 4.366274 20.947364
The_Elephant_Whisperers 3.473158 15.055387
The_Whale_(2022_film) 1.968665 10.318191
Top_Gun:_Maverick 0.667426 9.117375
Women_Talking_(film) 5.701720 8.801358
Table 11. Z-Score Table Summary Statistics
In [34]:
z_df.describe()
Out[34]:
Nominations Awarding
count 40.000000 40.000000
mean 3.162344 18.745701
std 3.487761 10.268590
min -1.367073 1.716332
25% 0.641470 10.194730
50% 3.128974 16.433653
75% 4.367704 26.905907
max 14.171374 39.153387

Based on the summary statistics of the z-score table (Table 11), on average, the Nominations spike was significant, however this is eclipsed when compared to the Awarding spike which was over 18 standard deviations away from the mean of March 2023. We also see that the standard deviation of the z-scores is greater during the awarding which suggests that the different pages did not have the same magnitude in terms of their individual spikes.

The winner for Best Costume Design, Ruth E. Carter for Black Panther: Wakanda Forever, got the highest z-score during the awarding ceremony at around 39 standard deviations away, while the lowest spike for the same day is Paul Rogers, the winner of Best Film Editing for Everything, Everywhere, All at Once. However, we should also take note that the z-score are affected by the mean and standard deviation of the distribution of pageviews for the rest of the month and is not solely on the total number of pageviews for those days.

Back to Table of Contents

Conclusions and Insights


In summary, the findings of the research show that the Oscars have a significant effect on the Wikipedia views of the parties concerned, specifically the actors, production staff, and the films themselves. The data shows that the actors, whose average views were already quite high, experienced a substantial increase in views on Oscars day. The respective films they were featured in followed a similar trend in their views as well. These findings supported the initial assumption that the Oscars have a significant impact, with the average views increasing after the nominations compared to the time before. However, another aspect of the findings revealed that not all members of the team are recognized with the same level of activity. While actors and films received considerable attention, other production staff members did not experience the same spike in interest. This disparity highlights the varying levels of public recognition within the industry and suggests that further research could explore ways to bring more balanced attention to all contributors of a film.

Back to Table of Contents

Recommendations


The insights generated from this Data Analysis has sufficiently answered the problem stated earlier. However, there some improvements that can be done for future researchers of this matter.

Of paramount importance is to ensure the proper gathering of the dataset. As found earlier, there are multiple dates that were not included in our analysis due to errors in reading the respective pageview dump. Additionally, filtering using page titles did not give a complete result for the set range because of the possible revisions done by Wikipedia contributors. However, there is also a challenge in acquiring and using pageid for filtering since it will not consider users of the mobile app platform (pageids are null for this mode).

The use of Spark made it possible to do distributed computing on the large dataset. However, for this study, only one machine with four logical cores was used to manage the dataset. To speed up the process, cluster computing with multiple machines can be employed which will distribute the partitions to more executors.

Future research can also expound on the study by pairing it with clickstream data to understand the source of pageviews. Pageviews coming from neighbor pages can explain the possible connections between pages and show the behavior of Wikipedia users. Additionally, future studies can challenge the assumptions and findings here by exploring the nominees as well as the winners of the Oscars.

Back to Table of Contents

References


  • Pageview complete dumps. (n.d.). Wikimedia.Org. Retrieved May 15, 2024, from https://dumps.wikimedia.org/other/pageview_complete/readme.html
  • Wikipedia contributors. (2024, May 6). 95th Academy Awards. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=95th_Academy_Awards&oldid=1222580018

Back to Table of Contents

Acknowledgements


In preparing this technical paper, Open AI’s ChatGPT was employed to help rephrase sentences and to improve structure, clarity, and readability of the document. This tool does not serve as a primary source of information but to enhance the presentation of the research conducted.

The team would also like to acknowledge Prof. Christian Alis for his mentorship and patience in making this study possible.

Back to Table of Contents